传统的假视频检测方法输出篡改图像的可能性值或可疑掩码。但是,这种无法解释的结果不能用作令人信服的证据。因此,更好地追溯虚假视频来源。传统的散列方法用于检索语义 - 相似的图像,这不能区分图像的细微差别。具体地,与传统视频检索相比,源跟踪。从类似的源视频中找到真实的挑战是一项挑战。我们设计了一种新的损失哈希多粒损失,解决了人们的视频非常相似的问题:与不同角度相同的场景,与同一个人的类似场景。我们提出了基于视觉变压器的模型,名为视频跟踪和篡改本地化(VTL)。在第一阶段,我们通过Vithash(VTL-T)训练哈希中心。然后,将假视频输入到Vithash,该vithash输出散列码。哈希码用于从哈希中心检索源视频。在第二阶段,源视频和假视频被输入到生成器(VTL-L)。然后,掩盖可疑区域以提供辅助信息。此外,我们构建了两个数据集:DFTL和Davis2016-TL。对DFTL的实验明显展示了我们在类似视频的追踪中框架的优势。特别地,VTL还通过在Davis2016-TL上实现了与最先进的方法的相当性能。我们的源代码和数据集已在github上发布:\ url {https:/github.com/lajlksdf/vtl}。
translated by 谷歌翻译
在自动语音识别(ASR)研究中,歧视性标准在DNN-HMM系统中取得了出色的性能。鉴于这一成功,采用判别标准是有望提高端到端(E2E)ASR系统的性能。有了这一动机,以前的作品将最小贝叶斯风险(MBR,歧视性标准之一)引入了E2E ASR系统中。但是,基于MBR的方法的有效性和效率受到损害:MBR标准仅用于系统培训,这在训练和解码之间造成了不匹配;基于MBR的方法中的直接解码过程导致需要预先训练的模型和缓慢的训练速度。为此,在这项工作中提出了新的算法,以整合另一种广泛使用的判别标准,无晶格的最大互信息(LF-MMI),不仅在训练阶段,而且在解码过程中。提出的LF-MI训练和解码方法显示了它们对两个广泛使用的E2E框架的有效性:基于注意力的编码器解码器(AEDS)和神经传感器(NTS)。与基于MBR的方法相比,提出的LF-MMI方法:保持训练和解码之间的一致性;避开直立的解码过程;来自具有卓越训练效率的随机初始化模型的火车。实验表明,LF-MI方法的表现优于其MBR对应物,并始终导致各种框架和数据集从30小时到14.3k小时上的统计学意义改进。所提出的方法在Aishell-1(CER 4.10%)和Aishell-2(CER 5.02%)数据集上实现了最先进的结果(SOTA)。代码已发布。
translated by 谷歌翻译
尽管端到端(E2E)自动语音识别(ASR)的快速进展,但已经证明将外部语言模型(LMS)结合到解码中可以进一步提高E2E ASR系统的识别性能。为了与E2E ASR系统中采用的建模单元对准,通常使用子字级(例如,字符,BPE)LMS与当前的E2E ASR系统配合。但是,使用子字级LMS将忽略单词级信息,这可能会限制E2E ASR中的外部LMS的强度。虽然已经提出了几种方法在E2E ASR中包含了单词级外部LMS,但这些方法主要针对具有清晰字界的语言,例如英语,并且不能直接应用于普通话等语言,其中每个字符序列可以具有多个对应的语言字序列。为此,我们提出了一种新颖的解码算法,其中在飞行中构造了单词级格子,以考虑每个部分假设的所有可能的字序列。然后,通过将产生的格子与外部单词N-GRAM LM交叉来获得假设的LM得分。在关注的基于编码器 - 解码器(AED)和神经换能器(NT)框架上检查所述方法。实验表明,我们的方法始终如一地优于次字级LMS,包括N-GRAM LM和神经网络LM。我们在Aishell-1(Cer 4.18%)和Aishell-2(Cer 5.06%)数据集上实现最先进的结果,并在21k小时的普通话数据集中减少14.8%。
translated by 谷歌翻译
最近,端到端(E2E)框架在各种自动语音识别(ASR)任务上取得了显着的结果。但是,无格的最大互信息(LF-MMI),作为在混合ASR系统中显示出卓越性能的鉴别性培训标准之一,很少在E2E ASR框架中采用。在这项工作中,我们提出了一种新的方法,将LF-MMI标准集成到培训和解码阶段的E2E ASR框架中。该方法显示了其在两个最广泛使用的E2E框架上的有效性,包括基于注意的编码器解码器(AED)和神经传感器(NTS)。实验表明,LF-MMI标准的引入始终如一地导致各种数据集和不同E2E ASR框架的显着性能改进。我们最好的模型在Aishell-1开发/测试集上实现了4.1 \%/ 4.4 \%的竞争力;我们还在强大的基线上实现了对Aishell-2和Librispeech数据集的显着误差。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译